Figure 1: Race and Age of Getting Diabetes

The correlation between the age and diabetes depending on the race is visualized in bar plots. In the data visualization process, we eliminated the data points over the age 100, and changed the bin width to 10. This visualizes the overall trend better and the data points are tidied up. Moreover, we used the facet function in order to show the trend depending on the race. We can see an overall trend that the number of people who are diagnosed for diabetes gradually increases and a big leap from 35-40 to 50. Above 55, the number of gradually decreases. Based on Figure 1, we can observe that out of all the races, non-Hispanic White is most prone to being diagnosed for diabetes. Also, around the age 50 is when the people were diagnosed for diabetes the most. In conclusion, non-Hispanic White is the race that was most diagnosed for diabetes among other ethnicities like Hispanic, Other Hispanic, non-Hispanic White, non-Hispanic Black, non-Hispanic Asian, and non-Hispanic multiracial. Moreover, we can conclude that at the age approximately 50, the population is most prone to being diagnosed for diabetes.

demo2 <- demo
demo2$RIDRETH3 <- factor(demo2$RIDRETH3)
race_ethnicity <- demo2 %>%
  select(SEQN,RIDRETH3) %>%
  mutate(race = fct_recode(RIDRETH3,
             "non-Hispanic multiracial" = "7",
             "non-Hispanic Asian" = "6",
             "non-Hispanic Black" = "4",
             "non-Hispanic White" = "3",
             "Other Hispanic" = "2",
             "Hispanic" = "1"))

age_diabetes <- diab %>%
  select(SEQN, DID040) %>%
  drop_na() %>%
  mutate(age=DID040) %>% #Age when the doctor told you have diabetes
  filter(age<100)

race_age <- inner_join(race_ethnicity, age_diabetes, by="SEQN") %>%
  select(race, age)#Inner join with the race and age dataset since they both have the common column, SEQN. 

plot<- ggplot(data=race_age, aes(x=age))+geom_histogram(binwidth=10)+
  facet_wrap(~race)
#We set the binwidth to 10 to visualize the trend better. We can observe that non-Hispanic White is the race that is diagnosed the most out of all the races. Also, approximately at age 50 is when most of them were diagnosed for diabetes. 

ggplotly(plot)

Figure 2: Triglyceride Level by Race

A box plot was implemented to show the relationship between race and Triglyceride levels. Data points above 150 were removed as outliers to clearly show the trend. The race does not have a significantly strong effect on the Triglyceride levels as shown in the graph, but there are some minor differences. African Americans have the lowest triglyceride level on average, and Other Hispanic has the highest triglyceride level on average. According to Figure 3, the level of triglyceride across age shows that when triglyceride level is the highest around 50 years old, this age group also corresponds to a high vulnerability of having diabetes. If high Triglyceride level is correlated with vulnerability of having diabetes, Other Hispanic would be most prone to have diabetes, but that is not true according to Figure 1. Thus, there are probably other factors, other than triglyceride, that mark the increase of people having diabetes around the age 55. Studies found that although high triglycerides may increase the risk for diabetes, diabetes increases triglyceride levels, too. The two conditions are intertwined. People with diabetes who have high triglycerides are at greater risk for heart attack or stroke than those with normal triglyceride levels (Murad et al., 2012).

race_trig <- inner_join(race_ethnicity, trigly, by="SEQN") %>%
  select(race, LBXTR)#Inner join with the race and Trig dataset since they both have the common column, SEQN. 

race_trig %>%
  filter(LBXTR<=150)%>% # remove outliers above 150 to see the trend. 
  ggplot(aes(x=race,y=LBXTR))+
  geom_boxplot()+ylab("Triglyceride Level (mg/dL)")+labs(title="Triglyceride Level by Race") # Non-Hispanic White is most prevalent to diabetes

Figure 3: Average Triglyceride Level by Age

In the figure “Average Triglyceride Level (mg/dL) by Age”, the average of triglyceride level at different age of all participants in 2017-2018 NHANES are plotted. All participants involved in the examination of triglyceride level are above 12 years old. Due to the collection method of NHANES, age 80 years old and above are all top-coded to the 80 years old group. From the trendline, it is observed that the triglyceride level is highest around age 45 years old which is above 125 mg/dL, lowest around age 12 years old which is below 75 mg/dL in average; the average triglyceride level is slowly increased from 12 years old to 45 years old and slowly decreased to around 120 mg/dL since then. According to the data, the highest average level of triglyceride occurs at 51 years old (triglyceride=200.875 mg/dL). The difference of highest triglyceride level between the actual data and trendline might be due to the fact that there are generally higher triglyceride levels examined around age 45 which elevates the overall average of that range. Overall, we can conclude that the general range of higher triglyceride levels occurs in between 45 years old to 55 years old. Such a trend of increasing average triglyceride level as age increases is because triglyceride level is elevated with increasing consumption of carbohydrates, sugar food and alcohol as well as uncontrolled obesity (Healthwise, 2020). With the irregular dietary habits and life routine, population of age 45 to 55 years old tend to have the highest average triglyceride level and therefore are at higher risk of getting type 2 diabetes and all the correlated syndromes such as metabolic syndrome, obesity, heart disease and stroke (Tirosh et al., 2008).

#triglyceride levels, LBXTR
#age of diagnosed as having diabetes, DID040 
#create a table containing sequence number and triglyceride level.
trig <- trigly %>%
  select(SEQN, LBXTR) %>% 
  drop_na()

#create a table containing the age of each participant.
age_screening <- demo %>%  #demographic
  select(SEQN, RIDAGEYR) %>%
  drop_na() %>%
  filter(RIDAGEYR < 100) 

#joining the triglyceride table and age table match by the sequence number. Only contain the observations in both table.
trig_age <- inner_join(trig, age_screening, by="SEQN") %>%
  mutate(age=RIDAGEYR, triglyceride=LBXTR) %>%
  select(SEQN, age, triglyceride)


#Create a scatterplot of the correlation between average triglyceride level and age 
trig_age %>%
  group_by(age) %>%
  summarise(average_trig = mean(triglyceride)) %>% #calculate the average triglyceride level of each age group.
  ggplot(aes(x = age, y = average_trig)) + geom_point() + geom_smooth() + xlab("Age") + ylab("Average Triglyceride Level (mg/dL)") + labs(title="Average Triglyceride Level (mg/dL) by Age")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Figure 4: Average Triglyceride Level by Age in Different Race

In combining the analysis of the relationship between race and triglyceride level and the relationship between age and triglyceride level, we produce the animated figure “Average Triglyceride Level (mg/dL) by Age in Different Race” as such. The figure shows the average triglyceride level at different ages in different races. The overall average of mean triglyceride level of all participants in the study is shown as the fixed trendline in the figure. In general, it is observed that non-Hispanic Black ethnicity are overall ranged at lower triglyceride levels throughout all age (lower than the average of all participants). Hispanic and other Hispanic are generally having higher triglyceride levels for all age groups (higher than the average of all participants). Non-Hispanic Asian have higher average triglyceride levels ranging from 57 years old to 62 years old. Non-Hispanic White has a higher average triglyceride level ranging from 42 to 50 years old. The discrepancy between different racial groups might be due to the different dietary habits and ethnicity difference in different lipid levels (Sumner, 2009). In summary, the figure reflects the relatively higher risk of getting type 2 diabetes of the Hispanic race and the averagely higher risk of having high triglyceride level, and therefore higher risk of type 2 diabetes of the middle age population (40 years old to 60 years old).

#trig level by age in different race
demo$race_char <- as.character(demo$RIDRETH3)

race_ethnicity <- demo %>%
  select(SEQN,race_char, RIDRETH3) %>%
  mutate(Race = fct_recode(race_char,
    "non-Hispanic multiracial" = "7",
    "non-Hispanic Asian" = "6",
    "non-Hispanic Black" = "4",
    "non-Hispanic White" = "3",
    "Other Hispanic" = "2",
    "Hispanic" = "1" ))


age_race_trig <- inner_join(trig_age, race_ethnicity, by="SEQN") #joining the triglyceride level and race datasets.
avg_age_race_trig <- age_race_trig %>%
  group_by(Race, age) %>% #group the triglyceride level and age table according to the race and age.
  summarise(average_triglyceride = mean(triglyceride)) #calculate the average triglyceride level of each age group at each ethnicity group.
## `summarise()` has grouped output by 'Race'. You can override using the `.groups` argument.
#animate scatterplot for age, race, trig
p2 <- ggplot(data=avg_age_race_trig, aes(x=age, y= average_triglyceride)) + geom_point(aes(color=Race, size=average_triglyceride, frame=age)) + geom_smooth() + xlab("Age") + ylab("Average Triglyceride Level (mg/dL)") + labs(title="Average Triglyceride Level (mg/dL) by Age in Different Race") + theme(plot.title=element_text(size=12))
## Warning: Ignoring unknown aesthetics: frame
ggplotly(p2) %>%
  animation_opts(transition = 500, easing = "linear", mode = "immediate")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Figure 5: Blood Pressure and Chloesterol Level by Age

The purpose of this graph is to compare the trends of blood pressure and cholesterol to see if there is a chronological order of occurrence, given that both index are linked with diabetes. The result shows that both index happens around the same time of mostly from 50-70 years olds. Since the topic of our analysis is diabetes, it is important to examine key diabetes precursors and to find a relationship between them.

bloodpres <- bloodpressure %>%
  select(SEQN, BPQ020,BPQ080) %>%
  drop_na()
names(bloodpres)=c("SEQN", "High Blood Pressure", "High Cholesterol")


bloodpresNew<- pivot_longer(bloodpres, c(`High Blood Pressure`,`High Cholesterol`), names_to="Heart Disease Risk Factors", values_to="Have it or not")

age_screening <- demo %>%
  select(SEQN, RIDAGEYR) %>%
  drop_na() %>%
  filter(RIDAGEYR < 100) 

bloodpres_age <- inner_join(bloodpresNew, age_screening, by="SEQN") %>%
  mutate(age=RIDAGEYR)
bloodpres_age <- subset (bloodpres_age, select = -RIDAGEYR)

bloodpres_age <- subset (bloodpres_age, bloodpres_age$`Have it or not` == "1")
bloodpres_age <- subset (bloodpres_age, select = -`Have it or not`)
bloodpres_age <- subset (bloodpres_age, select = -SEQN)

ggplot(data = bloodpres_age, mapping = aes(x = `Heart Disease Risk Factors`, y = age)) +
  geom_boxplot() +
  labs(title="Diabetes Factors by Age")

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.